31 research outputs found

    Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data

    Get PDF
    Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today. In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources. We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event. Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering. -- L’intĂ©gration des donnĂ©es promet d’ĂȘtre l’un des principaux catalyseurs permettant d’extraire des nouveaux aperçus de la richesse des donnĂ©es biologiques dĂ©jĂ  disponibles publiquement. Cependant, l’hĂ©tĂ©rogĂ©nĂ©itĂ© des sources de donnĂ©es existantes pose encore des dĂ©fis importants pour parvenir Ă  l’interopĂ©rabilitĂ© des bases de donnĂ©es biologiques. De plus, en surmontant seulement les dĂ©fis techniques de l’intĂ©gration des donnĂ©es, par exemple grĂące Ă  l’utilisation de formats standard de reprĂ©sentation de donnĂ©es, on laisse ouvert un problĂšme encore plus grand. À savoir, la courbe d’apprentissage abrupte nĂ©cessaire pour comprendre la modĂ©li- sation des donnĂ©es choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent ĂȘtre interrogĂ©s et jointes. Par consĂ©quent, la plupart des donnĂ©es biologiques publiquement disponibles restent pratiquement inexplorĂ©s aujourd’hui. Dans cette thĂšse, nous abordons l’ensemble des deux problĂšmes, en introduisant d’abord une solution d’intĂ©gration de donnĂ©es basĂ©e sur ontologies, afin d’attĂ©nuer le problĂšme d’hĂ©tĂ©- rogĂ©nĂ©itĂ© des sources de donnĂ©es. Nous montrons, Ă  travers l’exemple de Bgee, une base de donnĂ©es d’expression de gĂšnes, une approche permettant les bases de donnĂ©es relationnelles d’ĂȘtre publiĂ©s sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela prĂ©sente l’important avantage que la source de donnĂ©es d’origine peut rester inchangĂ©, tout en de- venant interopĂ©rable avec les sources RDF externes. Nous complĂ©tons nos mĂ©thodes avec des Ă©tudes de cas appliquĂ©es, conçues pour guider les experts du domaine dans la formulation de requĂȘtes fĂ©dĂ©rĂ©es expressives, ciblant les don- nĂ©es intĂ©grĂ©es dans les domaines des relations Ă©volutionnaires et de l’expression des gĂšnes. Plus prĂ©cisĂ©ment, nous introduisons deux analyses comparatives, d’abord dans le mĂȘme do- maine (en utilisant des donnĂ©es d’orthologie provenant de plusieurs sources de donnĂ©es in- teropĂ©rables) et ensuite Ă  travers des domaines interconnectĂ©s, afin d’étudier la relation entre le changement d’expression et le taux d’évolution suite Ă  une duplication de gĂšne. Enfin, afin de mitiger le dĂ©calage sĂ©mantique entre les utilisateurs et les donnĂ©es, nous concevons et implĂ©mentons Bio-SODA, un systĂšme de rĂ©ponse aux questions sur des graphes de connaissances domaine-spĂ©cifique, qui ne nĂ©cessite pas de donnĂ©es de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similaritĂ© syntactique et sĂ©mantique, tout en incorporant des mĂ©triques de centralitĂ© des nƓuds, pour classer les possibles candidats en rĂ©ponse Ă  une question utilisateur donnĂ©e. Nos rĂ©sultats suite aux tests effectuĂ©s en utilisant Bio-SODA sur plusieurs bases de donnĂ©es Ă  travers plusieurs domaines (tantĂŽt liĂ©s Ă  la bioinformatique qu’extĂ©rieurs) montrent que Bio-SODA rĂ©ussit Ă  rĂ©pondre Ă  des questions complexes, en- gendrant multiples entitĂ©s, au-delĂ  de l’état actuel de la technique en matiĂšre de systĂšmes de rĂ©ponses aux questions sur les donnĂ©es structures, en particulier graphes de connaissances

    On the Potential of Artificial Intelligence Chatbots for Data Exploration of Federated Bioinformatics Knowledge Graphs

    Full text link
    In this paper, we present work in progress on the role of artificial intelligence (AI) chatbots, such as ChatGPT, in facilitating data access to federated knowledge graphs. In particular, we provide examples from the field of bioinformatics, to illustrate the potential use of Conversational AI to describe datasets, as well as generate and explain (federated) queries across datasets for the benefit of domain experts

    Federating and querying heterogeneous and distributed Web APIs and triple stores

    Get PDF
    Today's international corporations such as BASF, a leading company in the crop protection industry, produce and consume more and more data that are often fragmented and accessible through Web APIs. In addition, part of the proprietary and public data of BASF's interest are stored in triple stores and accessible with the SPARQL query language. Homogenizing the data access modes and the underlying semantics of the data without modifying or replicating the original data sources become important requirements to achieve data integration and interoperability. In this work, we propose a federated data integration architecture within an industrial setup, that relies on an ontology-based data access method. Our performance evaluation in terms of query response time showed that most queries can be answered in under 1 second

    A hybrid approach for alarm verification using stream processing, machine learning and text analytics

    Get PDF
    False alarms triggered by security sensors incur high costs for all parties involved. According to police reports, a large majority of alarms are false. Recent advances in machine learning can enable automatically classifying alarms. However, building a scalable alarm verification system is a challenge, since the system needs to: (1) process thousands of alarms in real-time, (2) classify false alarms with high accuracy and (3) perform historic data analysis to enable better insights into the results for human operators. This requires a mix of machine learning, stream and batch processing – technologies which are typically optimized independently. We combine all three into a single, real-world application. This paper describes the implementation and evaluation of an alarm verification system we developed jointly with Sitasys, the market leader in alarm transmission in central Europe. Our system can process around 30K alarms per second with a verification accuracy of above 90%

    Big data architecture for intelligent maintenance : a focus on query processing and machine learning algorithms

    Get PDF
    Exploiting available condition monitoring data of industrial machines for intelligent maintenance purposes has been attracting attention in various application fields. Machine learning algorithms for fault detection, diagnosis and prognosis are popular and easily accessible. However, our experience in working at the intersection of academia and industry showed that the major challenges of building an end-to-end system in a real-world industrial setting go beyond the design of machine learning algorithms. One of the major challenges is the design of an end-to-end data management solution that is able to efficiently store and process large amounts of heterogeneous data streams resulting from a variety of physical machines. In this paper we present the design of an end-to-end Big Data architecture that enables intelligent maintenance in a real-world industrial setting. In particular, we will discuss various physical design choices for optimizing high-dimensional queries, such as partitioning and Z-ordering, that serve as the basis for health analytics. Finally, we describe a concrete fault detection use case with two different health monitoring algorithms based on machine learning and classical statistics and discuss their advantages and disadvantages. The paper covers some of the most important aspects of the practical implementation of such an end-to-end solution and demonstrates the challenges and their mitigation for the specific application of laser cutting machines

    A mobile detector for measurements of the atmospheric muon flux in underground sites

    Full text link
    Muons comprise an important contribution of the natural radiation dose in air (approx. 30 nSv/h of a total dose rate of 65-130 nSv/h), as well as in underground sites even when the flux and relative contribution are significantly reduced. The flux of the muons observed in underground can be used as an estimator for the depth in mwe (meter water equivalent) of the underground site. The water equivalent depth is an important information to devise physics experiments feasible for a specific site. A mobile detector for performing measurements of the muon's flux was developed in IFIN-HH, Bucharest. Consisting of 2 scintillator plates (approx. 0.9 m2) which measure in coincidence, the detector is installed on a van which facilitates measurements at different locations at surface or underground. The detector was used to determine muon fluxes at different sites in Romania. In particular, data were taken and the values of meter water equivalents were assessed for several locations from the salt mine from Slanic Prahova, Romania. The measurements have been performed in 2 different galleries of the Slanic mine at different depths. In order to test the stability of the method, also measure- ments of the muon flux at surface at different elevations were performed. The results were compared with predictions of Monte-Carlo simulations using the CORSIKA and MUSIC codes

    The state of the Martian climate

    Get PDF
    60°N was +2.0°C, relative to the 1981–2010 average value (Fig. 5.1). This marks a new high for the record. The average annual surface air temperature (SAT) anomaly for 2016 for land stations north of starting in 1900, and is a significant increase over the previous highest value of +1.2°C, which was observed in 2007, 2011, and 2015. Average global annual temperatures also showed record values in 2015 and 2016. Currently, the Arctic is warming at more than twice the rate of lower latitudes

    Global Retinoblastoma Presentation and Analysis by National Income Level.

    Get PDF
    Importance: Early diagnosis of retinoblastoma, the most common intraocular cancer, can save both a child's life and vision. However, anecdotal evidence suggests that many children across the world are diagnosed late. To our knowledge, the clinical presentation of retinoblastoma has never been assessed on a global scale. Objectives: To report the retinoblastoma stage at diagnosis in patients across the world during a single year, to investigate associations between clinical variables and national income level, and to investigate risk factors for advanced disease at diagnosis. Design, Setting, and Participants: A total of 278 retinoblastoma treatment centers were recruited from June 2017 through December 2018 to participate in a cross-sectional analysis of treatment-naive patients with retinoblastoma who were diagnosed in 2017. Main Outcomes and Measures: Age at presentation, proportion of familial history of retinoblastoma, and tumor stage and metastasis. Results: The cohort included 4351 new patients from 153 countries; the median age at diagnosis was 30.5 (interquartile range, 18.3-45.9) months, and 1976 patients (45.4%) were female. Most patients (n = 3685 [84.7%]) were from low- and middle-income countries (LMICs). Globally, the most common indication for referral was leukocoria (n = 2638 [62.8%]), followed by strabismus (n = 429 [10.2%]) and proptosis (n = 309 [7.4%]). Patients from high-income countries (HICs) were diagnosed at a median age of 14.1 months, with 656 of 666 (98.5%) patients having intraocular retinoblastoma and 2 (0.3%) having metastasis. Patients from low-income countries were diagnosed at a median age of 30.5 months, with 256 of 521 (49.1%) having extraocular retinoblastoma and 94 of 498 (18.9%) having metastasis. Lower national income level was associated with older presentation age, higher proportion of locally advanced disease and distant metastasis, and smaller proportion of familial history of retinoblastoma. Advanced disease at diagnosis was more common in LMICs even after adjusting for age (odds ratio for low-income countries vs upper-middle-income countries and HICs, 17.92 [95% CI, 12.94-24.80], and for lower-middle-income countries vs upper-middle-income countries and HICs, 5.74 [95% CI, 4.30-7.68]). Conclusions and Relevance: This study is estimated to have included more than half of all new retinoblastoma cases worldwide in 2017. Children from LMICs, where the main global retinoblastoma burden lies, presented at an older age with more advanced disease and demonstrated a smaller proportion of familial history of retinoblastoma, likely because many do not reach a childbearing age. Given that retinoblastoma is curable, these data are concerning and mandate intervention at national and international levels. Further studies are needed to investigate factors, other than age at presentation, that may be associated with advanced disease in LMICs

    The global retinoblastoma outcome study : a prospective, cluster-based analysis of 4064 patients from 149 countries

    Get PDF
    DATA SHARING : The study data will become available online once all analyses are complete.BACKGROUND : Retinoblastoma is the most common intraocular cancer worldwide. There is some evidence to suggest that major differences exist in treatment outcomes for children with retinoblastoma from different regions, but these differences have not been assessed on a global scale. We aimed to report 3-year outcomes for children with retinoblastoma globally and to investigate factors associated with survival. METHODS : We did a prospective cluster-based analysis of treatment-naive patients with retinoblastoma who were diagnosed between Jan 1, 2017, and Dec 31, 2017, then treated and followed up for 3 years. Patients were recruited from 260 specialised treatment centres worldwide. Data were obtained from participating centres on primary and additional treatments, duration of follow-up, metastasis, eye globe salvage, and survival outcome. We analysed time to death and time to enucleation with Cox regression models. FINDINGS : The cohort included 4064 children from 149 countries. The median age at diagnosis was 23·2 months (IQR 11·0–36·5). Extraocular tumour spread (cT4 of the cTNMH classification) at diagnosis was reported in five (0·8%) of 636 children from high-income countries, 55 (5·4%) of 1027 children from upper-middle-income countries, 342 (19·7%) of 1738 children from lower-middle-income countries, and 196 (42·9%) of 457 children from low-income countries. Enucleation surgery was available for all children and intravenous chemotherapy was available for 4014 (98·8%) of 4064 children. The 3-year survival rate was 99·5% (95% CI 98·8–100·0) for children from high-income countries, 91·2% (89·5–93·0) for children from upper-middle-income countries, 80·3% (78·3–82·3) for children from lower-middle-income countries, and 57·3% (52·1-63·0) for children from low-income countries. On analysis, independent factors for worse survival were residence in low-income countries compared to high-income countries (hazard ratio 16·67; 95% CI 4·76–50·00), cT4 advanced tumour compared to cT1 (8·98; 4·44–18·18), and older age at diagnosis in children up to 3 years (1·38 per year; 1·23–1·56). For children aged 3–7 years, the mortality risk decreased slightly (p=0·0104 for the change in slope). INTERPRETATION : This study, estimated to include approximately half of all new retinoblastoma cases worldwide in 2017, shows profound inequity in survival of children depending on the national income level of their country of residence. In high-income countries, death from retinoblastoma is rare, whereas in low-income countries estimated 3-year survival is just over 50%. Although essential treatments are available in nearly all countries, early diagnosis and treatment in low-income countries are key to improving survival outcomes.The Queen Elizabeth Diamond Jubilee Trust and the Wellcome Trust.https://www.thelancet.com/journals/langlo/homeam2023Paediatrics and Child Healt

    Cyclone: Unified Stream and Batch Processing

    No full text
    Due to the rising demand for large-scale data processing, there is a growing interest in both batch processing, where large volumes of data are processed offline, and stream processing, where large quantities of streaming data are processed online. The dichotomy between these vastly different computing paradigms has led to the development of substantially different methodologies and systems. As there is an increasing number of applications requiring stream and batch processing, there is a need to bridge this gap and offer support for both paradigms. We introduce a new direction for the unification of stream and batch processing, which, contrary to other proposed approaches, uses a stream processing platform as its foundation and supports batch processing on top. Our proof-of-concept implementation of such a middleware layer, called Cyclone, offers the widely popular MapReduce programming model and translates MapReduce jobs for execution on the underlying streaming platform. Cyclone not only achieves a tight integration of batch and stream processing, our evaluation further shows significant performance gains, in particular for sequential and iterative jobs, which naturally arise in many applications
    corecore